Quora is a place to gain and share knowledge—about anything. It’s a platform to ask questions and connect with people who contribute unique insights and quality answers. This empowers people to learn from each other and to better understand the world.
Over 100 million people visit Quora every month, so it's no surprise that many people ask similarly worded questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question, and make writers feel they need to answer multiple versions of the same question. Quora values canonical questions because they provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
- Data will be in a file Train.csv
- Train.csv contains 5 columns : qid1, qid2, question1, question2, is_duplicate
- Size of Train.csv - 60MB
- Number of rows in Train.csv = 404,290
"id","qid1","qid2","question1","question2","is_duplicate" "0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to invest in share market?","0" "1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?","0" "7","15","16","How can I be a good geologist?","What should I do to be a great geologist?","1" "11","23","24","How do I read and find my YouTube comments?","How can I see all my Youtube comments?","1"
It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.
Source: https://www.kaggle.com/c/quora-question-pairs#evaluation
Metric(s):
We build train and test by randomly splitting in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient points to work with.
# For faster computations using numpy array
import numpy as np
# For maintaing data in dataframes
import pandas as pd
# For plotting graphs
import matplotlib.pyplot as plt
# For visualizing and plotting stats
import seaborn as sbrn
# For extracting advanced features
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
# For calculating lcs value
!pip install distance
import distance
# For getting stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# For using stopwords in pre-processsing
from nltk.corpus import stopwords
# For converting text into tokens
from nltk.tokenize import word_tokenize
# For stemming
from nltk.stem import PorterStemmer
# For html parsing
from bs4 import BeautifulSoup
# Useful in preprocessing
import re
# For making wordcloud
from wordcloud import WordCloud
# For standardizing data
from sklearn.preprocessing import MinMaxScaler
# For visualizing data in lower dimensions
from sklearn.manifold import TSNE
# For plotting 3-D Plot
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
# For feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
# For word embeddings
import spacy
# For scores, splitting and other stuff
from sklearn import model_selection
# For operating on sparse data
from scipy.sparse import coo_matrix, hstack
# For using logistic regression
from sklearn.linear_model import LogisticRegression
# For using svm classifer - hinge loss function of sgd
from sklearn import linear_model
# For hyperparameter tuning
from sklearn.model_selection import GridSearchCV
# For getting probability scores
from sklearn.calibration import CalibratedClassifierCV
# For plotting confusion matrix
from sklearn.metrics import confusion_matrix
# For metrics evaluation
from sklearn.metrics.classification import accuracy_score, log_loss
# For using xgboost classifier
import xgboost as xgb
# For reading data from a file present in google drive
from google.colab import drive
# For tracking the progress of the execution
from tqdm import tqdm_notebook as tqdm
# For ignoring warnings
import warnings
drive.mount('/content/drive/');
warnings.filterwarnings('ignore');
totalData = pd.read_csv('drive/My Drive/train.csv');
usingData = totalData.copy()
usingData.head(5)
usingData.info()
As you can see there are 2 null objects in question2 and 1 null object in question 1
# Getting null data points
usingData[usingData.isnull().any(axis=1)]
# Replacing nan points
usingData.fillna('', inplace = True);
usingData.shape
usingData.iloc[105780]
usingData.groupby('is_duplicate')['id'].count().plot.bar()
print("Total number of question pairs: ", usingData.shape[0]);
print("Percentage of non-duplicate question pairs ", (usingData[usingData['is_duplicate'] == 0].shape[0] / usingData.shape[0])*100, '%');
print("Number of duplicate question pairs ", (usingData[usingData['is_duplicate'] == 1].shape[0] / usingData.shape[0])*100, '%');
questionIds = pd.Series(usingData['qid1'].tolist() + usingData['qid2'].tolist());
uniqueQuestions = np.unique(questionIds);
numUniqueQuestionsRepeated = np.sum(questionIds.value_counts() > 1)
print('Number of unique questions: ', len(uniqueQuestions));
print('Number of unique questions repeated: ', numUniqueQuestionsRepeated);
numDuplicatePairs = usingData[['qid1', 'qid2', 'is_duplicate']].groupby(['qid1', 'qid2']).count().reset_index();
print("Number of duplicate pairs: ", len(numDuplicatePairs) - len(numDuplicatePairs));
plt.figure(figsize=(20, 10))
plt.hist(questionIds.value_counts(), bins = 160);
plt.yscale("log", nonposy = 'clip');
plt.title("Histogram of question appearance counts(log-histogram)");
plt.xlabel("Number of occurences of questions");
plt.ylabel("Log-Number");
# processingData = usingData.iloc[0:100000]
# processingData.shape
processingData = usingData.copy()
Let us now construct a few features like:
processingData['freq_qid1'] = processingData.groupby('qid1')['qid1'].transform('count');
processingData['freq_qid2'] = processingData.groupby('qid2')['qid2'].transform('count');
processingData['q1len'] = processingData.apply(lambda dataPoint: len(dataPoint['question1']), axis = 1);
processingData['q1len'] = processingData.apply(lambda dataPoint: len(dataPoint['question2']), axis = 1);
processingData['q1_n_words'] = processingData.apply(lambda dataPoint: len(dataPoint['question1'].split()), axis = 1);
processingData['q2_n_words'] = processingData.apply(lambda dataPoint: len(dataPoint['question2'].split()), axis = 1);
def compute_word_common(dataPoint):
return float(len(set(dataPoint['question1'].lower().strip().split()) & set(dataPoint['question2'].lower().strip().split())))
def compute_word_total(dataPoint):
return float(len(dataPoint['question1'].lower().strip().split() + dataPoint['question2'].lower().strip().split()))
def compute_word_share(dataPoint):
return compute_word_common(dataPoint) / compute_word_total(dataPoint);
processingData['word_common'] = processingData.apply(lambda dataPoint: compute_word_common(dataPoint), axis = 1);
processingData['word_total'] = processingData.apply(lambda dataPoint: compute_word_total(dataPoint), axis = 1);
processingData['word_share'] = processingData.apply(lambda dataPoint: compute_word_share(dataPoint), axis = 1);
processingData['sum_freq'] = processingData.apply(lambda dataPoint: dataPoint['freq_qid1'] + dataPoint['freq_qid2'], axis = 1);
processingData['diff_freq'] = processingData.apply(lambda dataPoint: abs(dataPoint['freq_qid1'] - dataPoint['freq_qid2']), axis = 1);
processingData.head(4)
plt.figure(figsize=(16, 8));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'word_share', data = processingData[0:]);
plt.title('Box plot of word_share');
plt.subplot(1, 2, 2);
sbrn.distplot(processingData[processingData['is_duplicate'] == 1]['word_share'][0:], label = '1', color = 'green');
sbrn.distplot(processingData[processingData['is_duplicate'] == 0]['word_share'][0:], label = '0', color = 'red');
plt.title('Distribution plot of word_share');
plt.show();
plt.figure(figsize=(16, 8));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'word_common', data = processingData[0:]);
plt.title('Box plot of word_common');
plt.subplot(1, 2, 2);
sbrn.distplot(processingData[processingData['is_duplicate'] == 1]['word_common'][0:], label = '1', color = 'green');
sbrn.distplot(processingData[processingData['is_duplicate'] == 0]['word_common'][0:], label = '0', color = 'red');
plt.title('Distribution plot of word_share');
plt.show();
plt.figure(figsize=(14, 10));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'word_share', data = processingData[0:]);
plt.subplot(1, 2, 2);
sbrn.distplot(processingData[processingData['is_duplicate'] == 1]['word_share'][0:], label = '1', color = 'green');
sbrn.distplot(processingData[processingData['is_duplicate'] == 0]['word_share'][0:], label = '0', color = 'red');
plt.show();
Preprocessing:
SAFE_DIV = 0.0001
stopWords = set(stopwords.words('english'))
def preprocess(text):
text = str(text).lower()
text = text.replace(",000,000", "m").replace(",000", "k").replace("′", "'").replace("’", "'")\
.replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
.replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
.replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
.replace("he's", "he is").replace("she's", "she is").replace("'s", " own")\
.replace("%", " percent ").replace("₹", " rupee ").replace("$", " dollar ")\
.replace("€", " euro ").replace("'ll", " will")
text = re.sub(r"([0-9]+)000000", r"\1m", text)
text = re.sub(r"([0-9]+)000", r"\1k", text)
stemmer = PorterStemmer()
pattern = re.compile('\W')
if type(text) == type(''):
text = re.sub(pattern, ' ', text)
if type(text) == type(''):
text = stemmer.stem(text)
htmlParsedText = BeautifulSoup(text)
text = htmlParsedText.get_text()
return text;
processingData["question1"] = processingData["question1"].fillna("").apply(preprocess)
processingData["question2"] = processingData["question2"].fillna("").apply(preprocess)
preProcessedQuestions1WithStopWords = [];
preProcessedQuestions1WithOutStopWords = [];
preProcessedQuestions2WithStopWords = [];
preProcessedQuestions2WithOutStopWords = [];
for i, dataPoint in tqdm(processingData.iterrows()):
preProcessedText = preprocess(dataPoint['question1']);
preProcessedQuestions1WithStopWords.append(preProcessedText[0]);
preProcessedQuestions1WithOutStopWords.append(preProcessedText[1]);
preProcessedText = preprocess(dataPoint['question2']);
preProcessedQuestions2WithStopWords.append(preProcessedText[0]);
preProcessedQuestions2WithOutStopWords.append(preProcessedText[1]);
Definition:
Features:
ctc_min : Ratio of common_token_count to min lenghth of token count of Q1 and Q2
ctc_min = common_token_count / (min(len(q1_tokens), len(q2_tokens))
ctc_max : Ratio of common_token_count to max lenghth of token count of Q1 and Q2
ctc_max = common_token_count / (max(len(q1_tokens), len(q2_tokens))
last_word_eq : Check if First word of both questions is equal or not
last_word_eq = int(q1_tokens[-1] == q2_tokens[-1])
first_word_eq : Check if First word of both questions is equal or not
first_word_eq = int(q1_tokens[0] == q2_tokens[0])
abs_len_diff : Abs. length difference
abs_len_diff = abs(len(q1_tokens) - len(q2_tokens))
mean_len : Average Token Length of both Questions
mean_len = (len(q1_tokens) + len(q2_tokens))/2
fuzz_ratio : https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
fuzz_partial_ratio : https://github.com/seatgeek/fuzzywuzzy#usage
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
def getLongestSubstrRatio(a, b):
strs = list(distance.lcsubstrings(a, b))
if len(strs) == 0:
return 0
else:
return len(strs[0]) / (min(len(a), len(b)) + 1);
def extractAdvancedFeatures(question1, question2):
q1Tokens = question1.split();
q2Tokens = question2.split();
if len(q1Tokens) == 0 or len(q2Tokens) == 0:
return [0.0]*15;
q1Words = set([token for token in q1Tokens if token not in stopWords])
q2Words = set([token for token in q2Tokens if token not in stopWords])
q1StopWords = set([token for token in q1Tokens if token in stopWords])
q2StopWords = set([token for token in q2Tokens if token in stopWords])
commonWordCount = len(q1Words & q2Words);
commonStopWordCount = len(q1StopWords & q2StopWords);
commonTokenCount = len(set(q1Tokens) & set(q2Tokens));
cwc_min = commonWordCount/(min(len(q1Words), len(q2Words)) + SAFE_DIV);
cwc_max = commonWordCount/(max(len(q1Words), len(q2Words)) + SAFE_DIV);
csc_min = commonStopWordCount/(min(len(q1StopWords), len(q2StopWords)) + SAFE_DIV);
csc_max = commonStopWordCount/(max(len(q1StopWords), len(q2StopWords)) + SAFE_DIV);
ctc_min = commonTokenCount/(min(len(q1Tokens), len(q2Tokens)) + SAFE_DIV);
ctc_max = commonTokenCount/(max(len(q1Tokens), len(q2Tokens)) + SAFE_DIV);
lastWordEq = int(q1Tokens[-1] == q2Tokens[-1]);
firstWordEq = int(q1Tokens[0] == q2Tokens[0]);
absLenDiff = abs(len(q1Tokens) - len(q2Tokens));
meanLen = (len(q1Tokens) + len(q2Tokens))/2;
fuzzRatio = fuzz.QRatio(question1, question2);
fuzzPartialRatio = fuzz.partial_ratio(question1, question2);
fuzzTokenSortRatio = fuzz.token_sort_ratio(question1, question2);
fuzzTokenSetRatio = fuzz.token_set_ratio(question1, question2);
longestSubStrRatio = getLongestSubstrRatio(question1, question2);
advancedFeatures = (cwc_min, cwc_max, csc_min, csc_max, ctc_min, ctc_max, lastWordEq, firstWordEq, absLenDiff, meanLen,\
fuzzRatio, fuzzPartialRatio, fuzzTokenSortRatio, fuzzTokenSetRatio, longestSubStrRatio);
return advancedFeatures;
processingData[['cwc_min', 'cwc_max', 'csc_min', 'csc_max', 'ctc_min', 'ctc_max', 'last_word_eq', 'first_word_eq',\
'abs_len_diff', 'mean_len', 'fuzz_ratio', 'fuzz_partial_ratio', 'token_sort_ratio', 'token_set_ratio', 'longest_substr_ratio']]\
= processingData.apply(lambda x: pd.Series(extractAdvancedFeatures(x['question1'], x['question2'])), axis = 1);
preProcessedData = pd.read_csv('drive/My Drive/data_with_all_features.csv');
duplicatedData = preProcessedData[preProcessedData['is_duplicate'] == 1]
nonDuplicatedData = preProcessedData[preProcessedData['is_duplicate'] == 0]
print('Number of duplicated data points: ', duplicatedData.shape[0]);
print('Number of non-duplicated data points: ', nonDuplicatedData.shape[0]);
duplicatedDataText = []
nonDuplicatedDataText = []
for i, dataPoint in duplicatedData.iterrows():
duplicatedDataText.extend(str(dataPoint['question1']).split() + str(dataPoint['question2']).split());
for i, dataPoint in nonDuplicatedData.iterrows():
nonDuplicatedDataText.extend(str(dataPoint['question1']).split() + str(dataPoint['question2']).split());
print('Number of words in duplicate pair questions: ', len(duplicatedDataText));
print('Number of words in non-duplicate pair questions: ', len(nonDuplicatedDataText));
stopWords = set(stopwords.words('english'))
stopWords.add("said")
stopWords.add("br")
stopWords.add(" ")
stopWords.remove("not");
stopWords.remove("no");
wordcloud = WordCloud(stopwords = stopWords, max_words = len(duplicatedDataText), background_color = 'white');
wordcloud.generate(' '.join(duplicatedDataText));
plt.figure(figsize = (12, 12));
plt.imshow(wordcloud, interpolation = 'bilinear');
plt.axis('off');
plt.title("Word Cloud of Duplicated Pair Questions");
plt.show()
wordcloud = WordCloud(stopwords = stopWords, max_words = len(nonDuplicatedDataText), background_color = 'white');
wordcloud.generate(' '.join(nonDuplicatedDataText));
plt.figure(figsize = (12, 12));
plt.imshow(wordcloud, interpolation = 'bilinear');
plt.axis('off');
plt.title("Word Cloud of Non-duplicated Pair Questions");
plt.show()
sbrn.pairplot(preProcessedData.iloc[0:50000], hue='is_duplicate', vars=['ctc_min', 'cwc_min', 'csc_min', 'token_sort_ratio', 'fuzz_ratio', 'fuzz_partial_ratio', 'token_set_ratio'], palette = 'husl');
plt.show()
plt.figure(figsize=(10, 6));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'token_sort_ratio', data = preProcessedData[0:]);
plt.title('Box plot of token sort ratio');
plt.show();
plt.figure(figsize=(10, 6));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'token_set_ratio', data = preProcessedData[0:]);
plt.title('Box plot of token set ratio');
plt.show();
plt.figure(figsize=(10, 6));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'fuzz_ratio', data = preProcessedData[0:]);
plt.title('Box plot of fuzz ratio');
plt.show();
plt.figure(figsize=(10, 6));
plt.subplot(1, 2, 1);
sbrn.violinplot(x = 'is_duplicate', y = 'fuzz_partial_ratio', data = preProcessedData[0:]);
plt.title('Box plot of fuzz partial ratio');
plt.show();
preProcessedSubSampled = preProcessedData.iloc[0:5000]
perplexities = [20, 30, 50, 80]
iterations = [1000]
X = MinMaxScaler().fit_transform(preProcessedSubSampled[['cwc_min', 'cwc_max', 'csc_min', 'csc_max' , 'ctc_min' , 'ctc_max' , 'last_word_eq', 'first_word_eq' , 'abs_len_diff' , 'mean_len' , 'token_set_ratio' , 'token_sort_ratio' , 'fuzz_ratio' , 'fuzz_partial_ratio' , 'longest_substr_ratio']])
y = preProcessedSubSampled['is_duplicate'].values
for perplexity in perplexities:
for numIterations in iterations:
tsneData = TSNE(n_components = 2, perplexity = perplexity, n_iter = numIterations, verbose = 2, random_state = 101, angle = 0.5)\
.fit_transform(X);
dataFrame = pd.DataFrame({'x': tsneData[:, 0], 'y': tsneData[:, 1], 'label': y});
sbrn.lmplot(data=dataFrame, x='x', y='y', hue='label', fit_reg=False, size=8,palette="Set1",markers=['s','o'])
plt.title("perplexity : {} and max_iter : {}".format(perplexity, numIterations))
plt.show()
numericalFeatures = ['freq_qid1', 'freq_qid2', 'q1len', 'q1_n_words', 'q2_n_words', 'word_common', \
'word_total', 'word_share', 'sum_freq', 'diff_freq', 'cwc_min', 'cwc_max', \
'csc_min', 'csc_max', 'ctc_min', 'ctc_max', 'last_word_eq', \
'first_word_eq', 'abs_len_diff', 'mean_len', 'fuzz_ratio', 'fuzz_partial_ratio',\
'token_sort_ratio', 'token_set_ratio', 'longest_substr_ratio'];
# Utility function that is used to print eqaul signs
def equalsBorder(num):
print("="*num);
preProcessedData = pd.read_csv('drive/My Drive/data_with_all_features.csv');
# Replacing nan points
preProcessedData.fillna('', inplace = True)
preProcessedData = preProcessedData[0:100000]
classesData = preProcessedData['is_duplicate']
trainData, testData, classesTrain, classesTest = model_selection.train_test_split(preProcessedData, classesData);
questions1 = trainData['question1'];
questions2 = trainData['question2'];
questions = list(questions1) + list(questions2);
questions = pd.Series(questions).fillna("").tolist()
print("Training data shape: ", trainData.shape);
print("Test data shape: ", testData.shape)
tfIdfVectorizer = TfidfVectorizer(lowercase = False, min_df = 5, max_features = 8000);
tfIdfTotalModel = tfIdfVectorizer.fit(questions);
tfIdfTrainQuestion1Model = tfIdfTotalModel.transform(trainData['question1']);
tfIdfTrainQuestion2Model = tfIdfTotalModel.transform(trainData['question2']);
tfIdfTestQuestion1Model = tfIdfTotalModel.transform(testData['question1']);
tfIdfTestQuestion2Model = tfIdfTotalModel.transform(testData['question2']);
print(tfIdfTrainQuestion1Model.shape)
print(tfIdfTrainQuestion2Model.shape)
print(tfIdfTestQuestion1Model.shape)
print(tfIdfTestQuestion2Model.shape)
# This function plots the confusion matrices given y_i, y_i_hat.
def plot_confusion_matrix(test_y, predict_y):
C = confusion_matrix(test_y, predict_y)
A =(((C.T)/(C.sum(axis=1))).T)
B =(C/C.sum(axis=0))
plt.figure(figsize=(20,4))
labels = [1,2]
# representing A in heatmap format
cmap=sbrn.light_palette("blue")
plt.subplot(1, 3, 1)
sbrn.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Confusion matrix")
plt.subplot(1, 3, 2)
sbrn.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Precision matrix")
plt.subplot(1, 3, 3)
# representing B in heatmap format
sbrn.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Class')
plt.ylabel('Original Class')
plt.title("Recall matrix")
plt.show()
techniques = ['Tf-Idf'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((trainData[numericalFeatures],\
tfIdfTrainQuestion1Model,\
tfIdfTrainQuestion2Model))
testMergedData = hstack((testData[numericalFeatures],\
tfIdfTestQuestion1Model,\
tfIdfTestQuestion2Model))
lrClassifier = LogisticRegression(penalty = 'l2');
tunedParameters = {'C': [0.0001, 0.01, 0.1, 1, 10, 100, 10000]};
classifier = GridSearchCV(lrClassifier, tunedParameters, cv = 5, scoring = 'neg_log_loss', return_train_score = True);
classifier.fit(trainingMergedData, classesTrain);
crossValidateLogLossMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateLogLossStdValues = classifier.cv_results_['std_test_score'];
plt.plot(tunedParameters['C'], crossValidateLogLossMeanValues, label = "Cross Validate Log-Loss");
plt.scatter(tunedParameters['C'], crossValidateLogLossMeanValues, label = ['Cross validate Log-Loss values']);
plt.gca().fill_between(tunedParameters['C'], crossValidateLogLossMeanValues - crossValidateLogLossStdValues, crossValidateLogLossMeanValues + crossValidateLogLossStdValues, alpha = 0.2, color = 'darkorange');
plt.xlabel('Hyper parameter: C values');
plt.ylabel('Scoring: Log-Loss values');
plt.grid();
plt.legend();
plt.show();
optimalHypParam1Value = classifier.best_params_['C'];
lrClassifier = LogisticRegression(penalty = 'l2', C = optimalHypParam1Value);
calibratedLrClassifier = CalibratedClassifierCV(base_estimator = lrClassifier, method = 'sigmoid');
calibratedLrClassifier.fit(trainingMergedData, classesTrain);
predProbScoresTraining = calibratedLrClassifier.predict_proba(trainingMergedData);
predProbScoresTest = calibratedLrClassifier.predict_proba(testMergedData);
print("Results of analysis using {} vectorized text merged with other features using logistic regression classifier: ".format(technique));
equalsBorder(70);
print("Optimal C Value: ", optimalHypParam1Value);
equalsBorder(40);
print("Log-Loss obtained: ", log_loss(classesTrain, predProbScoresTraining))
# Predicting classes of test data projects
predictedScoresTest =np.argmax(predProbScoresTest,axis=1)
equalsBorder(70);
plot_confusion_matrix(classesTest, predictedScoresTest);
techniques = ['Tf-Idf'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((trainData[numericalFeatures],\
tfIdfTrainQuestion1Model,\
tfIdfTrainQuestion2Model))
testMergedData = hstack((testData[numericalFeatures],\
tfIdfTestQuestion1Model,\
tfIdfTestQuestion2Model))
tunedParameters = {'alpha': [0.0001, 0.01, 0.1, 1, 10, 100, 10000]};
parameters = {};
cv_results_ = {'mean_test_score': [], 'std_test_score': []}
for alpha in tunedParameters['alpha']:
svmClassifier = linear_model.SGDClassifier(loss = 'hinge', class_weight = 'balanced', alpha = alpha);
calibratedSVMClassifier = CalibratedClassifierCV(base_estimator = svmClassifier, method = 'sigmoid');
classifier = GridSearchCV(calibratedSVMClassifier, parameters, cv = 5, scoring = 'neg_log_loss', return_train_score = True);
classifier.fit(trainingMergedData, classesTrain);
cv_results_['mean_test_score'].append(classifier.cv_results_['mean_test_score'][0]);
cv_results_['std_test_score'].append(classifier.cv_results_['std_test_score'][0]);
crossValidateLogLossMeanValues = np.array(cv_results_['mean_test_score']);
crossValidateLogLossStdValues = np.array(cv_results_['std_test_score']);
plt.plot(tunedParameters['alpha'], crossValidateLogLossMeanValues, label = "Cross Validate Log-Loss");
plt.scatter(tunedParameters['alpha'], crossValidateLogLossMeanValues, label = ['Cross validate Log-Loss values']);
plt.gca().fill_between(tunedParameters['alpha'], crossValidateLogLossMeanValues - crossValidateLogLossStdValues, crossValidateLogLossMeanValues + crossValidateLogLossStdValues, alpha = 0.2, color = 'darkorange');
plt.xlabel('Hyper parameter: alpha values');
plt.ylabel('Scoring: Log-Loss values');
plt.grid();
plt.legend();
plt.show();
optimalHypParam1Value = tunedParameters['alpha'][np.argmax(cv_results_['mean_test_score'])];
svmClassifier = linear_model.SGDClassifier(penalty = 'l2', alpha = optimalHypParam1Value);
calibratedSVMClassifier = CalibratedClassifierCV(base_estimator = svmClassifier, method = 'sigmoid');
calibratedSVMClassifier.fit(trainingMergedData, classesTrain);
predProbScoresTraining = calibratedSVMClassifier.predict_proba(trainingMergedData);
predProbScoresTest = calibratedSVMClassifier.predict_proba(testMergedData);
print("Results of analysis using {} vectorized text merged with other features using simple vector machine classifier: ".format(technique));
equalsBorder(70);
print("Optimal alpha Value: ", optimalHypParam1Value);
equalsBorder(40);
print("Log-Loss obtained: ", log_loss(classesTrain, predProbScoresTraining))
# Predicting classes of test data projects
predictedScoresTest =np.argmax(predProbScoresTest,axis=1)
equalsBorder(70);
plot_confusion_matrix(classesTest, predictedScoresTest);
techniques = ['Tf-Idf'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((trainData[numericalFeatures],\
tfIdfTrainQuestion1Model,\
tfIdfTrainQuestion2Model))
testMergedData = hstack((testData[numericalFeatures],\
tfIdfTestQuestion1Model,\
tfIdfTestQuestion2Model))
xgbClassifier = xgb.XGBClassifier(n_jobs = -1, subsample = 0.5, colsample_bytree = 0.5, scale_pos_weight = 0.18);
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 10)]
max_depth = [int(x) for x in np.linspace(1, 15, num = 11)]
learning_rate = [0.1, 0.3, 0.5, 0.8, 1]
randomGrid = {'n_estimators': n_estimators, 'max_depth': max_depth, 'learning_rate': learning_rate}
randomizedClassifier = model_selection.RandomizedSearchCV(estimator = xgbClassifier, param_distributions = randomGrid, n_iter = 50,\
cv = 3, verbose = 2, random_state = 42, n_jobs = -1, scoring = 'neg_log_loss');
randomizedClassifier.fit(trainingMergedData, classesTrain);
crossValidateLogLossMeanValues = randomizedClassifier.cv_results_['mean_test_score'];
crossValidateLogLossStdValues = randomizedClassifier.cv_results_['std_test_score'];
print("Cross-Validation log-loss values: ", crossValidateLogLossMeanValues);
bestnEstimatorsValue = randomizedClassifier.best_params_['n_estimators'];
bestMaxDepthValue = randomizedClassifier.best_params_['max_depth'];
bestLearningRate = randomizedClassifier.best_params_['learning_rate'];
xgbClassifier = xgb.XGBClassifier(n_jobs = -1, subsample = 0.5, colsample_bytree = 0.5, scale_pos_weight = 0.18,\
n_estimators = bestnEstimatorsValue, max_depth = bestMaxDepthValue, learning_rate = bestLearningRate);
calibratedXgbClassifier = CalibratedClassifierCV(base_estimator = xgbClassifier, method = 'sigmoid');
calibratedXgbClassifier.fit(trainingMergedData, classesTrain);
predProbScoresTraining = calibratedXgbClassifier.predict_proba(trainingMergedData);
predProbScoresTest = calibratedXgbClassifier.predict_proba(testMergedData);
print("Results of analysis using {} vectorized text merged with other features using xgb classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", bestnEstimatorsValue);
equalsBorder(40);
print("Optimal max_depth Value: ", bestMaxDepthValue);
equalsBorder(40);
print("Optimal learning_rate Value: ", bestLearningRate);
equalsBorder(40);
print("Log-Loss obtained: ", log_loss(classesTrain, predProbScoresTraining))
# Predicting classes of test data projects
predictedScoresTest =np.argmax(predProbScoresTest,axis=1)
equalsBorder(70);
plot_confusion_matrix(classesTest, predictedScoresTest);
randomizedClassifier.cv_results_
resultsDataFrame = pd.DataFrame()
resultsDataFrame = resultsDataFrame.append({'Estimator': 'Logistic Regression','Vectorizer': 'Tf-Idf','Hyper Parameters': 'C - 1', 'Log-Loss': '0.3301'}, ignore_index = True);
resultsDataFrame = resultsDataFrame.append({'Estimator': 'Linear SVM','Vectorizer': 'Tf-Idf','Hyper Parameters': 'Alpha - 0.0001','Log-Loss': '0.4011'}, ignore_index = True);
resultsDataFrame = resultsDataFrame.append({'Estimator': 'XGBoost','Vectorizer': 'Tf-Idf','Hyper Parameters': 'n_estimators-1000, max_depth-10, learning_rate-0.1','Log-Loss': '0.2104'}, ignore_index = True);
resultsDataFrame